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Abstract —We provide an overview of current approaches to 
DNA-based storage system design and accompanying synthesis, 
sequencing and editing methods. We also introduce and analyze 
a suite of new constrained coding schemes for both archival 
and random access DNA storage channels. The mathematical 
basis of our work is the construction and design of sequences 
over discrete alphabets that avoid pre-specified address patterns, 
have balanced base content, and exhibit other relevant substring 
constraints. These schemes adapt the stored signals to the DNA 
medium and thereby reduce the inherent error-rate of the system. 

1. Introduction 

Despite the many advances in traditional data recording 
techniques, the surge of Big Data platforms and energy con¬ 
servation issues have imposed new challenges to the storage 
community in terms of identifying extremely high volume, 
non-volatile and durable recording media. The potential for 
using macromolecules for ultra-dense storage was recognized 
as early as in the 1960s, when the celebrated physicists 
Richard Feynman outlined his vision for nanotechnology in 
the talk “There is plenty of room at the bottom.” Among 
known macromolecules, DNA is unique in so far that it lends 
itself to implementations of non-volatile recoding media of 
outstanding integrity (one can still recover the DNA of species 
extinct for more than 10,000 years) and extremely high storage 
capacity (a human cell, with a mass of roughly 3 pgrams, hosts 
DNA encoding 6.4 GB of information). Building upon the 
rapid growth of biotechnology systems for DNA synthesis and 
sequencing, two laboratories recently outlined architectures for 
archival DNA based storage in 0 0 The first architecture 
achieved a density of 700 TB/gram, while the second approach 
raised the density to 2 PB/gram. The success of the later 
method was largely attributed to the use of three elementary 
coding schemes, Huffman coding (a fixed-to-variable length 
entropy coding/compression method), differential coding (en¬ 
coding the differences of consecutive symbols or the difference 
between a sequence and a given template) and single parity- 
check coding (encoding of a single symbol indicating the 
parity of the string). More recent work 0 extended the coding 
approach used in in so far by replacing single parity-check 
codes by Reed-Solomon codes 0. 
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grant and the Strategic Research Initiative (SRI) Grant conferred by the 
University of Illinois, Urbana-Champaign. 



Figure 1.1. Block Diagram of Prototypical DNA-Based Storage Systems. 
A classical information source is encoded (converted into ASCII or some 
specialized word format, potentially compressed, and represented over a 
four letter alphabet); subsequently, the strings over four-letter alphabets are 
encoded using standard and DNA-adapted constrained and/or error-control 
coding schemes. The DNA codewords are synthesized, with potential un¬ 
desired mutations (errors) added in the process, and stored. When possible, 
rewriting is performed via classical DNA editing methods used in synthetic 
biology. Sequencing is performed either through Sanger sequencing 0, if 
short information blocks are accessed, or via High Throughput Sequencing 
(HTS) techniques, if large portions of the archive are selected for readout. 

All the aforementioned approaches have a number of draw¬ 
backs, including the lack of partial access to data - i.e., one 
has to reconstruct the whole sequence in order to read even one 
base - and the unavailability of rewrite mechanisms. Moving 
from a read only to a random access, rewritable memory 
requires a major paradigm shift in the implementation of the 
DNA storage system, as one has to append unique addresses 
to constituent storage DNA blocks that will not lead to 
erroneous cross-hybridization with the information encoded in 
the blocks; avoid using overlapping DNA blocks for increased 
coverage and subsequent synthesis, as they prevent efficient 
rewriting; ensure low synthesis (write) and sequencing (read) 
error rates of the DNA blocks. To overcome these and other is¬ 
sues, the authors recently proposed a (hybrid) DNA rewritable 
storage architecture with random access capabilities 0. The 
new DNA-based storage scheme encompasses a number of 
coding features, including constrained coding, ensuring that 
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DNA patterns prone to sequencing errors are avoided; prefix 
synchronized coding, ensuring that blocks of DNA may be 
accurately accessed without perturbing other blocks in the 
DNA pool; and low-density parity-check (LDPC) coding for 
classically stored redundancy combating rewrite errors 0 - 

The shared features of current DNA-based storage architec¬ 
tures are depicted in Figure The green circles denote the 
source and media, while the blue circles denote processing 
methods applied on the source and media. The processes of 
Encoding and DNA Encoding add controlled redundancy into 
the original source of digital information or into the DNA 
blocks, respectively. This redundancy can be use to combat 
synthesis (write) and sequencing (access and read) errors 
p0| . Synthesis is the biochemical process of creating physical 
double-stranded DNA strings that reliably represent the en¬ 
coded data strings. Synthesis thereby also creates the storage 
media itself - the DNA blocks. Storage refers to some means 
of storing the DNA strings, and it represents a communication 
channel that transfers information from one point in time to 
another. In rewritable architectures the Editing module 
refers to the process of creating mutations in the stored DNA 
strings (by deleting one or multiple substrings and potentially 
inserting other strings), while the Reading module refers to 
DNA sequencing that retrieves the content of selected DNA 
storage blocks and subsequent decoding. 

In order to understand how errors occur during the read 
and write process, we start our exposition by describing 
state-of-the art synthesis, sequencing and editing methods 
(Sections [^ [^ |^. We then proceed to discuss how synthesis, 
sequencing and editing methods are used in various DNA- 
storage paradigms (Section [^, and the accompanying coding 
techniques identified with different types of synthesis and 
sequencing errors. New constrained coding techniques for 
rewritable and random access systems, and their relationship to 
classical codes for magnetic and optical storage, are described 
in Section |6l 

Given the semi-tutorial and interdisciplinary nature of this 
manuscript, we refer readers with a limited background in 
synthetic biology to Appendix for a glossary of terms used 
throughout the paper. 

2. DNA Sequence Synthesis 

De novo DNA synthesis is a powerful biotechnology that 
enables the creation of DNA sequences without pre-existing 
templates. Synthesis tools have a myriad of applications in 
different research areas, ranging from traditional molecular 
biology to emerging fields of synthetic biology, nanotechnol¬ 
ogy and data storage. Vaguely speaking, most technologies for 
large-scale DNA synthesis rely on the assembly of pools of 
oligonucleotide building blocks into increasingly larger DNA 
fragments. The current high cost and small throughput of de 
novo synthesis of these building blocks represents the main 
limitation for widespread implementations of DNA synthesis 
systems: as an example, oligo synthesis methods via phos- 
phoramidite column-based synthesis, described in subsequent 
sections, may cost as much as $0.15 per nucleotide |TT[ . The 
maximum length of the produced oligostrings lies in the range 


100-200 nts m Hence, the synthesis of long DNA oligos 
using dozens of building blocks can cost anywhere from hun¬ 
dreds to thousands of US dollars. Therefore, it is imperative to 
develop new, high-quality, robust, and scalable DNA synthesis 
technologies that offer synthetic DNA at significantly more 
affordable prices. This is in particular the case for massive 
DNA-based storage systems, which may potentially require 
billions of nucleotides. 

Among the most promising synthesis technologies is the 
so called microarray-based synthesis methods; more than ten- 
to-hundreds of thousands oligos can be synthesized per one 
microarray, in conjunction with a decrease in the reagent 
consumption. For large scale DNA synthesis projects, the 
price of microarray-based synthesis is roughly $0,001 per 
nucleotide | pT| , p^ . Similarly to the case of phosphoramidite 
column-based synthesis, the length of microarray synthesized 
oligos usually does not exceed 200 nt. However, oligos syn¬ 
thesized in microarrays typically suffer from higher error rates 
than those generated by phosphoramidite column methods. 
Nevertheless, microarrays are the preferred synthesis tool for 
generating customized DNA-chips or for performing gene 
synthesis. Many projects are underway to bridge the gap 
between these two extremes, hight-cost and high-accuracy 
and low-cost, low-accuracy strategies and hence reduce the 
limitations of the corresponding methods (I3)> (E). 

To provide a better understanding of the basic principles of 
DNA-based storage and the limitations that need to be over¬ 
come in the writing process, we first describe different DNA 
synthesis methods from nucleotides to larger DNA molecules. 
We then discuss recent techniques that aim to improve the 
quality and reliability of the synthesized sequences. 

A. Chemical Oligonucleotide Synthesis 

Chemical synthesis of single stranded DNA originated more 
than 60 years ago, and since the 1950’s, when the first oligonu¬ 
cleotides were synthesized fT5|-p7|, four different chemical 
methods have been developed. These methods are named after 
the major reagents included in the process, and include i) 
H-phosphonate; ii) phosphodiester; iii) phosphotriester; and 
iv) phosphite triester/phosphoramidite. A detailed description 
of these methods may be found in p^ , | p9| , and here we 
only briefly review the advantages and disadvantages of these 
methods. 

The H-phosphonate method was first described in p^ , and 
it derives its name from the use of H-phosphonates nucleotides 
as building blocks. This approach was later refined in | [2Q| , 
(D where the H-phosphonate chemistry was improved to 
synthesize deoxyoligonucleotides on a solid support by using 
different oligo coupling (stitching) agents that expedite the 
reactions. The phosphodiester method was introduced in p7| , 
| [22} . The main contribution of the method was the production 
of protected dinucleotide monoposhpates, which prevented 
undesired elongation. Unfortunately, the approach also had 
one major drawback - the linkages between nucleotides were 
unprotected during the elongation step of the oligonucleotide 
chain, which allowed for the creation of branched oligonu¬ 
cleotides. The phosphotriester approach was also first pub¬ 
lished in the 50s (T5) and later improved by Letsinger | [^ , 
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p4| and Reese p5| using different reagents to protect the 
phosphate group in the internucleotide linkages. This approach 
also prevented the formation of branched oligonucleotides. 
Nevertheless, all the previously described methods and vari¬ 
ants thereof proved to be inefficient and time consuming. 

In the mid-seventies, a major advantage in synthesis tech¬ 
nology was reported by Letsinger p^ , solving in part a 
number of problems associated with other existing methods. 
His method was termed the phosphite triester approach. 
The basic idea behind the approach was that the reagent 
phosphorochloridite reacts with nucleotides faster than its 
chloridate counterpart used in previous approaches. In addition 
to expediting their underlying reactions, bifunctional phospho- 
rodichloridites unfortunately also produced undesirable side 
products such as symmetric dimers. A modified method that 
precluded the drawback of side products was developed by 
Caruthers et al. | [27l . The authors of | [27l used a different 
type of nucleoside phosphites that were more stable, reacted 
faster, and produced higher yields of the desired dinucleoside 
phosphite. The resulting method was named phosphoramidite 
synthesis. Another important contribution includes the tech¬ 
nology described in 1^ , where the use of stable and easy-to- 
prepare phosphoramidites facilitated the automation of oligo 
synthesis in solid-phase, making it the method-of-choice for 
chemical synthesis. 


B. Oligo Synthesis Platforms 

1) Column-Based Oligo Synthesis: The standard phospho- 
ramidite oligonucleotide synthesis operates via stepwise ad¬ 
dition of nucleotides to the growing chain which is immo¬ 
bilized on a solid support (Figure |2-B| ). Each addition cycle 
consists of four chemical steps: i) de-blocking; ii) coupling 
or condensation; iii) capping; and iv) oxidation |T^ . At the 
beginning of the synthesis process, the first nucleotide, which 
is attached to a solid substrate, is completely protected at 
all of its active sites. Therefore, to make a reaction possible 
and include a second nucleotide, it is necessary to remove 
the dimethoxytrityl (DMT) protecting group from the 5’- 
end by addition of an acid solution. The removal of the 
DMT group generates a reactive 5’-OH group (De-blocking 
step). Subsequently, a coupling step is performed via con¬ 
densation of a newly activated DMT-protected nucleotide and 
the unprotected 5’-OH group of the substrate-bound growing 
oligostrand through the formation of a phosphite triester link 
(Coupling or Condensation step). After the coupling step, 
some unprotected 5’-OH groups may still exist and react in 
later stages of additions of nucleotides leading to oligos with 
deletion and bursty deletion errors. To mitigate this problem, a 
capping reaction is performed by acetylation of the unreactive 
nucleotides (Capping step). Finally, the unstable phosphite 
triester linkage is oxidized to a more stable phosphate linkage 
using an iodine solution (Oxidation step). The cycle is repeated 
iteratively to obtain an oligonucleotide of the desired sequence 
composition. At the end of the synthesis, the oligonucleotide 
sequence is deprotected, and cleaved from the support to 
obtain a completely functional unit. 


2) Array-Based Oligo Synthesis: In the 90s, Affymetrix de¬ 
veloped a method for chemical synthesis of different polymers 
combining photolabile protecting groups and photolithogra¬ 
phy p9| , 1^ . The Affymetrix solution uses a photolitho¬ 
graphic mask to direct UV light in a targeted manner, so 
as to selectively deprotect and activate 5’ hydroxyl groups 
of nucleotides that should react with the nucleotide to be 
incorporated in the next step. The mask is designed to expose 
specific sites on the microarray to which new nucleotides will 
be added, with others sites being masked. Once synthesis is 
completed, the oligos are released from the array support and 
recovered as a complex mixture (pool) of sequences. 

A number of other, related methods have been developed for 
the purpose of synthesizing oligostrands on microarrays HD- 
For instance, the method developed by Agilent uses Ink¬ 
jet-based printing, where with high precision, picoliters of 
each incorporated nucleotide and activator can be spotted 
(deposited) at specific sites on an array. This ink-jet method 
mitigates the need for using photolithography masks p2| . In 
an alternative method commercialized by NimbleGen Systems, 
the photolithography masks are superseded by a virtual mask 
that is combined with digital programmable mirrors to activate 
specific locations on the array | [3^ , p4| . CustomArray (former 
CombiMatrix) developed a technology in which thousands of 
microelectrodes control acid production by an electrochemical 
reaction to deprotect the growing oligo at a desired spot p5| . 
In addition, oligo synthesis is implemented within a multi¬ 
chamber microfiuidic device coupled to a digital optical device 
that uses light to produce acid in the chambers p6| . Masking 
and printing errors may introduce both substitution and in¬ 
sertion and deletion errors, and when multiple sequences are 
synthesized simultaneously, the error patterns within different 
sequences may be correlated, depending on the location of 
their synthesis spots. 

Both solid-phase and microarray technologies exhibit a 
number of challenges that need to be overcome to reduce error 
rates and increase throughout. Side reactions such as depurina- 
tion (37), 13^ and reaction inefficiencies during the stepwise 
addition of nucleotides fT^ , fT^ reduce the desired yield, and 
generate errors in the sequence especially when synthesizing 
long oligostrands. In particular, these processing problems 
introduce both substitution and insertion and deletion errors. 
Thus, a purification step is usually necessary to identify and 
discard undesirable erroneous sequences. High-performance 
liquid chromatography and polyacrylamide gel electrophoresis 
can be used to eliminate truncated products, but both are 
expensive and time-consuming, and single insertions and dele¬ 
tions or substitution errors in the sequence often cannot be 
removed. Nevertheless, by optimizing chemical reaction and 
conditions the fidelity can be increased 

3) Complex Strand and Gene Synthesis: Traditionally, 
to generate DNA fragments of length several hundred nu¬ 
cleotides, a set of shorter length oligostrands is fused to¬ 
gether by either using ligation-based or polymerase-based 
reactions. Ligation-based approaches usually rely on ther¬ 
mostable DNA ligases that ligate phosphorylated overlapping 
oligos in high stringency conditions p9| . In polymerase-based 
approaches (Polymerase cycling assembly - PCA) oligos with 
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Figure 2.1. Main Steps of Column-Based Oligo Synthesis process of Section [2-Bl| The first step in DNA synthesis cycle is the deprotection of the 
support-bound nucleoside at the 5’ terminal end (1, highlighted in blue) by removal of the DMTrgroup. This step lead a nucleoside with a 5’ OH group 
(2, highlighted in red). During the coupling step an activated nucleoside (3) react with the 5’ OH group of the support-bound nucleoside (2) generating a 
dinucleotide phosphoramidite (4) (formation of phosphitetriester, highlighted in green-blue). In the capping step, unreacted 5’ OH are blocked by acetylation 
(5, highlighted in green) to prevent further chain extension. In the last step of the cycle the unstable phosphitetriester (in green-blue) is oxidized to phosphate 
linkage (6, highlighted in purple) which is more stable in the chemical conditions of the following synthesis steps. The cycle is repeated for each nucleoside 
addition. After the last step of the synthesis of the entire oligonucleotide, the final product needs to be cleavage from the solid support and deprotect the 5’ 
terminal end. In red is the pentose of the nucleoside, in blue the dimethoxytrityl (DMTr) protecting group, in green-blue the 2-cyanoethyl phosphoramiditegroup, 
and in purple the phosphate group. Grey spheres represent the solid support in which the growing oligo is attached. Circles highlight the group that is modihed 
in each step. 


overlapping regions are used to generate progressively longer 
double-stranded sequences | [4Q| . After assembly, synthesized 
sequences need to be PCR amplified, cloned, and verified, thus 
increasing the cost of production. Another approach developed 
by Gibson et al ED exploits yeast in vivo recombination to 
assemble a set of more than 30 oligos together with a plasmid, 
all in one step. The same group also synthesized the mouse 
mitochondrial genome from 600 overlapping oligos using an 
isothermal assembly method | [42} . 

Although microarray synthesis reduces the price of oligonu¬ 
cleotides, there are two major challenges that still hamper 
its use. First, hundreds of thousands of oligonucleotides can 
be made on a single microarray, but each oligo is produced 
in very small amounts. Second, the oligostrands are cleaved 
from the array all at once as a large heterogeneous pool 
that subsequently leads to difficulties in sequence assembly 
and cross-hybridization. A number of strategies have been 
recently developed to solve these problems. For example, PCR 
amplification increases the concentration of the oligos before 
assembly that combined with hybridization selection reduces 
the incorporation of oligonucleotides containing undesirable 
synthesis errors | [43| . A modification of this approach, based 
on hybridization selection embedded in the assembly process 
and coupled with the optimization of oligo design and as¬ 
sembly conditions was reported in | [44| . Still, large pools of 
oligos (>10000) increase difficulties in sequence assembly. 
Two different strategies have been described where subpools 
of oligos involved in a particular assembly were isolated, thus 
partially avoiding cross-hybridization. Kosuri et al. | [45| used 
predesigned barcodes to amplify subpools of oligos, and in 



Figure 2.2. Rewriting (Deletion and Insertion Edits) via gBlocks. This 
method is used when edits of relatively short length are required, as it is cost 
efficient and simple. Primers corresponding to unique contexts in the encoded 
DNA are used to access the edit region, which is subsequently cleaved and 
replaced by the gBlock. 


a second step removed the barcodes by digestion. In another 
approach, the microarray was physically divided in sub-arrays 
that enabled performing amplification and assembly separately 
in each microwell pb] . 

4) Error Correction: Despite having elaborate biochemical 
error removal processes in place, some residual errors tend 
to remain in the synthesized pool and additional errors arise 
during the assembly phase. A number of error-correction 
strategies have been reported in the literature ED, ED’ 
ED- Many of the current error-removal techniques rely on 
DNA mismatch recognition proteins. Denaturation and re¬ 
hybridization steps lead to double-stranded DNA with mis¬ 
matches between erroneous bases and the corresponding cor¬ 
rect bases. The disrupted sites are recognized and/or cleaved 
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Figure 2.3. Rewriting (Deletion and Insertion Edits) via OEPCR. OEPCR allows for incorporating customized sequence changes via primers used in 
amplification reactions. As the primers have terminal complementarity, two separate DNA fragments may be amplified and fused into a single sequence 
without using restriction endonuclease sites. Overlapping fragments are fused together in an extension reaction and PCR amplified. 


by mismatch recognition proteins. MutS is a protein that 
binds unpaired bases and small DNA loops (i.e., small un¬ 
matched substrings in DNA that protrude from the double 
helix). After denaturation and re-hybridization, MutS detects 
and binds the mismatched regions that are later removed by 
gel electrophoresis. This strategy reduces the error-rate to 1 
nucleotide per 10 Kb | |47| . “Consensus shuffling” is a variation 
of the MutS method where mismatch-containing pieces are 
captured by column-immobilized MutS proteins, and error-free 
fragments are eluted | [48| . In other variations of this method, 
two homologs of MutS immobilized in cellulose columns can 
reduce the error rate to 0.6 nucleotides per Kb at a very low 
cost On the other hand, in the MutHLS approach, MutS 
binds unpaired bases, while the protein MutL links the MutH 
endonuclease to the MutS bound sites that cleave the erroneous 
heteroduplexes. The correct sequences are recovered by gel 
electrophoresis | [50|. Similarly, resolvases (5D and single¬ 
strand nucleases ||52|, | [5^ may also be used to recognize and 
cleave mismatched sites in DNA heteroduplexes. It is worth 
pointing out that CEL endonuclease, its commercial version 
Surveyor™ nuclease (Transgenomic, Inc.) or a commercial 
CEL-based enzymatic cocktail, ErrASE, that recognizes and 
nicks at the base-substitution mismatch, is commonly used in 
practice due to its broad substrate specificity; it can reduce the 
error rate up to 1 nucleotide per 9.6 Kb | [54| , | [55| . 

The introduction of Next Generation Sequencing (NGS) 
platforms as high throughput purification methods opened new 
possibilities for error-free DNA synthesis. Matzas et al. | [56| 
combined a next-generation pyrosequencing platform with a 
robotic system to image and pick beads containing sequence- 
verified oligonucleotides. The estimated error rate using this 
approach is 1 nucleotide error per 21 Kb. One limitation of 
this method is that the “pick-and-place” recovery system is 
not accurate enough, due to the small size of clonal beads, to 
satisfy the increasing demand for long length DNA strands 
(involving 104 building blocks) | [57| . A new NGS-based 
method was recently announced, where specific barcoded 


primers were used to amplify only those oligos with the correct 
sequence | [58| , | [5^ . Similarly, a new method termed “sniper 
cloning” has been reported in | [57| . There, NGS platform beads 
containing sequence-verified oligonucleotides are recovered 
by “shooting” a laser pulse. This laser technology enables 
cost-effective, high throughput selective separation of correct 
fragments without cross-contamination. 

As a parting note, we observe that even single substitu¬ 
tion errors in the synthesis process may be detrimental for 
applications in biological and medical research. This is not 
the case for DNA-based storage systems, where the DNA 
strands are used as storage media which may have a non- 
negligible error rate. Synthesis errors may be easily combated 
through the introduction of carefully designed parity-checks 
of the information strings, as will be discussed in subsequent 
sections. 

3. DNA Editing 

Once desired information is stored in DNA by synthe¬ 
sizing properly encoded heteroduplexes, it may be rewritten 
using classical DNA editing techniques. DNA editing is the 
process of adding very specific point mutations (often with 
the precision of a few nucleotides) or deleting and inserting 
DNA substrings at tightly controlled locations. In the latter 
case, one needs to synthesize readily usable short-to-medium 
length DNA fragments. Eor this purpose, two techniques are 
commonly used: gBlocks Gene Eragments | [60| (see Integrated 
DNA Techologies) as building blocks for insertion and dele¬ 
tion edits, and Overlap-Extension PCR (OEPCR) as a 
means of adding the mutated blocks. 

gBlocks are double-stranded, precisely content-controlled 
DNA blocks that may be used for applications as diverse as 
gene construction, PCR and qPCR control, recombinant anti¬ 
body research, protein engineering, CRISPR-mediated genome 
editing and general medical research | [6^ . They are usually 
constructed at very low cost (fraction of a dollar) using gene 
fragments libraries, i.e., pools of short DNA strings that 
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contain up to 18 consecutive bases of type N (any nucleotide) 
or K (Keto). The libraries and library products are carefully 
tested for correct length via capillary electrophoresis, sequence 
composition via mass spectrometry; consensus protocols are 
used in the final verification stage to reduce any potential 
errors. The last stage, and additional quality control testing 
ensures that at least 80% of the generated pool contains the 
desired string. For strings with complex secondary structure, 
this percentage may be significantly lower. This calls for 
controlling the secondary structure of the products whenever 
the applications allows for it. Such is the case for DNA-based 
storage, and methods for designing DNA codewords with no 
secondary structure (predicted to the best extent possible via 
combinatorial techniques) were described in | [63| . 

DNA substring editing is frequently performed via special¬ 
ized PCR reactions. Of particular use in DNA rewriting is the 
process of OEPCR, illustrated in Figure 2-B3 IN OEPCR, 
one uses two primers to flank two ends of the string to be 
edited. For fragment deletion (splicing), the banking primers 
act like zippers that need to join over the segment to be 
sliced. Furthermore, the primer at the end to be joined is 
designed so that it has an overhanging part complementary 
to the overhanging part of the other primer. Via controlled 
hybridization, the DNA strands are augmented by a DNA 
insert that is also complementary to the underlying DNA 
strand. Upon completion of this extension, classical PCR am¬ 
plification is performed for the elongated sequence primers and 
the inserted overlapping fragments of the sequences are fused. 
Note that this method does not requiring restriction sites or 
enzymes. OEPCR is mostly used to insert oligonucleotides of 
lengths longer than 100 nucleotides. In OEPCR the sequence 
being modified is used to make two modified strands with the 
mutation at opposite ends, using the method outlined above. 
After denaturation, the strands are mixed, leading to different 
hybridization products. Of all the products, only one will allow 
for polymerase extension via the introduction of a primer - 
the heterodimer without overlap at the 5’ end. The duplex 
created by the polymerase is denatured once again and another 
primer is hybridized to the created DNA strand, introducing 
a sequence contained in the first primer. DNA replication 
consequently results in an extended sequence containing the 
desired insert. 


4. DNA Sequencing 

The goal of DNA sequencing is to read the DNA content, 
i.e., to determine the exact nucleotides and their order in a 
DNA molecule. Such information is critical in understanding 
both basic biology and human diseases as well as for devel¬ 
oping nature-inspired computational platforms. 

Sanger et al [ [64| first developed sequencing methods to 
sequence DNA based on chain termination (see Figure for 
an illustration). This technique, which is commonly referred 
to as Sanger sequencing, has been widely used for several 
decades and it is still being used routinely in numerous 
laboratories. The automated and parallelized approaches of 
Sanger sequencing directly led to the success of the Human 
Genome Project | [65| and the genome sequencing projects 
of other important model organisms for biomedical research 


(e.g., mouse |[66|). The availability of these entire genomes 
has provided scientists with unprecedented opportunities to 
make novel discoveries for genome architecture and genome 
function, trajectory of genome evolution, and molecular bases 
of phenotypic variation and disease mechanisms. 

However, in the past decade, the development of faster, 
cheaper, and higher-throughput sequencing technologies has 
dramatically expanded the reach of genomic studies. These 
“next-generation sequencing” (NGS) technologies, as opposed 
to Sanger sequencing which is considered as first-generation, 
have been one of the most disruptive modem technological 
advances. In general, the NGS technologies have several 
major differences when compared to Sanger sequencing. First, 
electrophoresis is no longer needed for reading the sequencing 
output (i.e., substring lengths) which is now typically detected 
directly. Second, more straightforward library preparations 
that do not use DNA clones have become a critical part 
of sequencing workflow. Third, tremendously large number 
of sequencing reactions are generated in parallel with ultra- 
high throughput. A demonstration of the significant NGS 
technology development is the cost reduction. Around the 
year 2001, the cost of sequencing a million base-pairs was 
about $5,000; but it only costs about $0.05 in mid 2015 
(http://www.genome.gov/sequencingcosts/). In other words, it 
will cost less than $5,000 to sequence an entire human genome 
with 30x coverage. This cost keeps dropping every few months 
due to new developments in sequencing technology. However, 
a clear shortcoming of NGS versus Sanger technologies has 
been data quality. The read lengths are much shorter and the 
error rate is higher as compared to Sanger sequencing. For 
instance, the read length from Illumina sequencing platforms 
ranges from 50bps to 300bps, making subsequent genome 
assembly extremely difficult, especially for genomes with a 
large proportion of repetitive elements/substrings. The error- 
rates of latest Illumina sequencing platforms, such as HiSeq 
2500 are less than 1%, and the errors are highly non-uniformly 
distributed along the sequenced reads: the terminal 20% of 
nucleotides have orders of magnitude higher error-rates than 
the remaining 80% of initial bases. 

The first NGS platform was introduced by 454 Life Sciences 
(acquired by Roche in 2007). Although Roche will shut down 
454 in 2016, 454 platforms have made significant contributions 
to both NGS technology development and biological applica¬ 
tions, including the first full genome of a human individual 
using NGS The 454 platform utilizes pyrosequencing. 
Briefiy, pyrosequencing operates as follows. DNA samples are 
first fragmented randomly. Then each fragment is attached 
to a bead and emulsion PCR is used to make each bead 
contain many copies of the initial fragment. The sequenc¬ 
ing machine contains numerous picoliter-volume wells, each 
containing a bead. In pyrosequencing, luciferase is used to 
produce light, initiated by pyrophosphate when a nucleotide 
is incorporated at each cycle during sequencing. One drawback 
of 454 sequencing is that multiple incorporation events occur 
in homopolymers. Therefore, as the length of a homopolymer 
is reflected by the light intensity, a number of sequencing 
errors arise in connection with homopolymers. We remark that 
such errors were accounted for in a number of DNA-storage 
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Figure 3.1. Main steps of the Sanger sequencing protocol. In the first step, a pool of DNA fragments is sequenced via synthesis. Synthesis terminates 
whenever chemically inactive versions of nucleotides (dd*TP) are incorporated into the growing chains. These inactive nucleotides are fluorescently labeled 
to uniquely determine their bases. In the second step, the fragments are sorted by length using capillary gel methods. The terminal step involves reading the 
last bases in the fragments using laser systems. 


implementations, even those using other sequencing platforms 
which typically do not introduce homopolymer errors. 

The SOLID planform, developed by Applied Biosystems 
(merged with Invitrogen to become Life Technologies in 
2008), was introduced in 2007. SOLiD uses sequencing by 
ligation; i.e., unlike 454, DNA ligase is used instead of 
polymerase to identify nucleotides. During sequencing, a pool 
of possible oligonucleotides of a certain length are labeled 
according to the sequenced position. These oligonucleotides 
are ligated by DNA ligase for matching sequences. Before 
sequencing, the DNA is amplified using emulsion PCR. Each 
of the resulting beads contains single copies of the same DNA 
molecule. The output of SOLiD is in color space format, an 
encoded form of the nucleotide sequences with four colors 
representing 16 combinations of two adjacent bases. 

The most frequently used sequencing platform so far has 
been Illumina. It’s sequencing technology was developed by 
Solexa, which was acquired by Illumina in 2007. The method 
is mainly based on reversible dye-terminators that allow the 
identification of nucleotide bases when they are introduced 
into DNA strands. DNA samples are first randomly fragmented 
and primers are ligated to both ends of the fragments. They are 
then attached on the surface of the flow cell and amplified - in 
a process also known under the name bridge amplification - 
so that local clonal DNA colonies, called “DNA clusters”, are 
created. To determine each nucleotide base in the fragments, 
sequencing by synthesis is utilized. A camera takes images of 
the fluorescently labeled nucleotides to enable base calling. 
Subsequently, the dye, along with the terminal 3’ blocker, 
is removed from the DNA to allow for the next cycle to 
begin with multiple iterations.The most frequently encountered 
errors in Illumina data are simple substitution errors. Much 
less common are deletion and insertion errors, and there is an 
indication that sequencing error rates are higher in regions in 
which there are homopolymers exceeding lengths 15 — 20 | [68| . 
Substitution errors arise when nucleotides are incorporated 
at different positions in the fragments of a cluster during 
the same cycle. They are also caused by clusters from more 
than one DNA fragment, resulting in mixed signals during 


the base calling step. Illumina sequencers have been used 
in numerous NGS applications, ranging from whole-genome 
sequencing, whole-exome sequencing, to RNA sequencing. 
Chip sequencing and others. The Illumina HiSeq 2500 system 
can generate up to 2 billion single-end reads (in 250 bp) per 
flow cell with 8 lanes. The recently announced HiSeq 4000 
system can produce up to 5 billion single-end reads per fiow 
cell. 

In addition, several other types of sequencing technologies 
have been developed in recent years, with the Pacific Bio¬ 
sciences (PacBio) single-molecule real time (SMRT) technol¬ 
ogy and the Oxford Nanopore’s nanopore sequencing systems 
being the most promising ones. In SMRT, no amplification 
is needed and the sequencer observes enzymatic reaction in 
real time. It is also sometimes referred as “third-generation 
sequencing” because it does not require any amplification 
prior to sequencing. The most significant advantage of PacBio 
data is the much longer read length as compared to other 
NGS technologies. SMRT can achieve read lengths exceeding 
10 Kbases, making it more desirable for finishing genome 
assemblies. Another advantage is speed - run times are much 
faster. However, the cost of PacBio sequencing is fairly high, 
amounting to a few dollars per million base-pairs. Further¬ 
more, SMRT error rates are significantly higher than those of 
Illumina sequencers and the throughput is much lower as well. 

Oxford Nanopore is considered another third-generation 
technology. Its approach is based on the readout from eletrical 
signals when a single-stranded DNA sequence passes through 
a nanoscale hole made from proteins or synthetic materials. 
The DNA passing through the nanopore would change its 
ion current, allowing the sequencing process to recognize 
nucleotide bases. Oxford Nanopore has developed a hand¬ 
held device called MinlON, which has been available to early 
users. MinlON can generate more than 150 million bases per 
run. However, the error rate is significantly higher than other 
technologies and it is still being improved. Some of the errors 
were identified in |TQ| as asymmetric errors, caused by two 
bases creating highly similar current impulse responses. 

Significant challenges of NGS still remain, in particular data 























analysis problems arising due to short read length. One major 
step after having the sequencing reads is to assemble reads 
into longer DNA fragments. Most of these assemblers follow a 
multi-stage procedure: correcting raw read errors, constructing 
contigs (i.e. contiguous sequences obtained via overlapping 
reads), resolving repeats, and connecting contigs into scaffolds 
using paired-end reads. Most de novo assemblers utilize the de 
Bruijn graph (DBG) data structure to represent large number 
of input short reads. EULER | [69| pioneered the use of DBG 
in genome assembly. In recent years, several NGS assemblers 
(such as Velvet 03’ ALLPATHS-LG | |7T| , SOAPdenovo (72) 
ABySS l?^ , SGA | |74| ) have shown promising performances. 

5. Archival DNA-Based Storage 
A. The Church-Gao-Kosuri Implementation 

The first large-scale archival DNA-based storage architec¬ 
ture was implemented and described in the seminal paper 
of Church et al. |[Tj. In the proposed approach, user data 
was converted to a DNA sequence via a symbol-by symbol 
mapping, encoding each data bit 0 into A or C, and each data 
bit 1 into T or G. Which of the two bases is used for encoding a 
particular bit is determined by a runlenghth constraint, i.e., one 
base is chosen randomly as long as it prohibits homopolymer 
runs of length greater than three. Lurthermore, the choice of 
one of the two bases enables control of the GC content and 
secondary structure within the DNA data blocks. 

To illustrate the feasibility of their approach, the authors 
of □ encoded in DNA a HTML file of size 5.27 MB. The file 
included 53,426 words, 11 JPG images and one Java Script 
file. In order to eliminate the need for long synthetic DNA 
strands that are hard to assemble, the file was converted into 
54,898 blocks of length 159 oligonucleotides. Each block con¬ 
tained 96 information oligonucleotides, 19 oligonucleatides for 
addressing, and 22 oligonucleotides for a common sequence 
used for amplification and sequencing. The 19 oligonucleotide 
addresses corresponded to binary encodings of consecutive 
integers, starting from 00 ... 001. 

The oligonucleotide library was synthesized using Ink¬ 
jet printed, high-fidelity DNA microchips p^ , described in 
Section To encode the data, the library was first amplified 
by limited-cycle PCR, and then sequenced on a single lane of 
an Illumina HiSeq system, as described in Section Because 
synthesis and sequencing errors occurred with low frequency, 
the DNA blocks were correctly decoded using their own 
encodings and decoded copies of overlapping blocks. As a 
result, only 10 bit errors were observed within the 5.27 million 
encoded bits, i.e., the reported system error rate was less than 
2 X 10-^ 

The architecture of the Church-Gao-Kosuri DNA-based 
encoding system is illustrated in Ligure |5-A| 

Encoding example: We provide next an example for the 
encoding algorithm proposed by Church-Gao-Kosuri Q. The 
text of choice is “ferential DN”. 

• Lirst, each symbol is converted into its 8 bit ASCII 
format. The encoding results in a binary string of length 


12 X 8 = 96 of the following form: 

f e r e n t 


i a 1 (space) D N 

onoiooionooooiononooooIooooooT^ 

• Second, a unique 19 bits barcode is appended to 
the binary string for the purpose of DNA block 
identification: here, we assume that the barcode is 
1000110111000110100. This results in a binary string of 
length 19 + 96 = 115, namely: 

barcode 

1000110111000110100011001100110010101110010011 

0010101101110011101000110100101100001011011000 

01000000100010001001110. 

• Third, every bit 0 is converted into A or C and every bit 1 
into T or G. This conversion is performed randomly, while 
disallowing homopolymer runs of length greater than 
three. The scheme also asks for balancing the GC content 
and controlling the secondary structure. Lor instance, the 
following DNA code generated from the example binary 
text satisfies all the aforementioned conditions: 

TAACGTCTTGCCCGGAGAAATGAATTCATTCATATATGTCAGAA 

TTCATAGCGGATGTAATGTCTACGTCTCATAGGCCCATAGTCTG 

CCACTACACCATACATAACTCCGTTA. 

• Linally, two primers of length 22 nt are added to 
both ends of the DNA block. The forward primer is 
CTACACGACGCTCTTCCGATCT, while the backward primer 
is just the reverse complement of the forward primer, 
AGATCGGAAGAGCGGTTCAGCA. Hence, the encoded DNA 
codeword is of length 22-1-115-1-22 = 159 nt, and reads 
as: 

forward 

CTACACGACGCTCTTCCGATCTTAACGTCTTGCCCGGAGAAATG 

AATTCATTCATATATGTCAGAATTCATAGCGGATGTAATGTCTA 

CGTCTCATAGGCCCATAGTCTGCCACTACACCATACATAACTCC 

GTTAAGATCGGAAGAGCGGTTCAGCA. 

backward 

B. The Goldman et al. Method 

To encode the digital information into a DNA sequence, 
Goldman et al. started with a binary data set. The binary 
file representation was obtained via ASCII encoding, using 
one byte per symbol (Step A). Each byte was subsequently 
converted into 5 or 6 trits via an optimal Huffman code for 
the underlying distribution of the particular dataset used. The 
compressed file comprised 5.2 x 10^ information bits (Step B). 
Each trit was then used to select one out of three DNA oligonu¬ 
cleotides differing from the last encoded oligonucleotide. This 
form of differential coding ensures that there are no homopoly¬ 
mer runs of any length greater than one (Step C). Linally, the 
resulting DNA string was partitioned into segments of length 
100 oligonucleotides, each of which has the property that it 
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Figure 5.1. A chosen text file is converted to ASCII format using 8 bits, for each symbol. Blocks of bits are subsequently encoded into DNA using a 1 
bit-per-oligonucleotide encoding. The entire 5.27 Mb html file amounted to 54, 898 oligonucleotides and was synthesized and eluted from a DNA microchip. 
After amplification - common primer sequences of the blocks are not shown - the library was sequenced using an Illumina platform. Individual reads with 
the correct barcode and length were screened for consensus, and then converted back into bits comprising the original hie. 


overlaps in 75 bases with each adjacent segment (Step D). 
This overlap ensures 4x coverage for each base. In addition, 
alternate segments of length 100 were reverse complemented. 
Indexing information, along with 2 trits for file identification, 
12 trits for intra-file location information (which can be used to 
encode up to 3^^ unique segment locations), one parity-check 
and one additional base are appended to both ends to indicate 
whether the entire fragment was reverse complemented or not. 
The resulting fragment lengths of the constituent encodings 
amounted to 153,335 oligos of length 117. 

As an experiment, Goldman et al. Q encoded a digital data 
file of size 739 KB with an estimated Shannon information 
of 5.2 X 10^ bits into DNA. Their file included all 154 of 
Shakespeare’s sonnets (ASCII text), a classic scientific paper 
(PDF format), a medium-resolution color photograph of the 
European Bioinformatics Institute (JPEG 2000 format), and a 
26-s excerpt from Martin Luther King’s 1963 T have a dream’ 
speech (MP3 format). The encoded strings were synthesized 
by an updated version of Agilent Technologies. For each 
sequence, 1.2 x 10 ^ copies were created, with 1 base error per 
500 bases, and sequenced on an Illumina HiSeq 2000 system, 
and decoded successfully. After several postprocessing steps, 
the original data was decoded with 100 % accuracy. 

The architecture of the Goldman et al DNA-based encoding 
system is illustrated in Figure |5-B| 

Encoding example: We present next a short example of the 
encoding algorithm introduced by Goldman et al The text 
to be encoded is “Birney and Goldman”. 

• First, we apply Huffman coding base 3 to compress the 

data, resulting in 

B i r n e y (space) a 

Si = 20100202T01010T600212000T22m^ 

n d (space) G o 1 d m a 

000212210002212222212021100210 ^ 

n 

00 ^ 


Let n = len{Si) = 92, which equals 10102 in base 
3. Hence, we set S 2 = 00000000000000010102 (an 
encoding of length 20 ) and = 0000000000000 (an 
encoding of length 13). Therefore, 

Sa = SiS^S2 = 201002021010101000212000122211102 
21201112000212210002212222212021100210122100110 
210111200021000000000000000000000000000010102, 

of total length 92 + 13 + 20 = 125. 

Applying differential coding to S '5 according to the table 

next 

previous 0 12 

A C G T 

C G T A 

G TAG 
T A C G 

results in an encoding of Sa that reads as 

= TAGTATATCGACTAGTACAGCGTAGCATCTCGCAGCGAGAT 
ACGCTGCTACGCAGCATGCTGTGAGTATCGATGACGAGTGACTCT 
GTACAGTACGTACGTACGTACGTACGTACGTACGACTAT. 

Since len{S^) = 125, there are two DNA blocks Fq and 
Fi of length 100 overlapping in exactly 75 bps, i.e., 

Fo = TAGTATATCGACTAGTACAGCGTAGCATCTCGCAGCGAGAT 

ACGCTGCTACGCAGCATGCTGTGAGTATCGATGACGAGTGACTCT 

GTACAGTACGTACG 

and 

Fi = CATCTCGCAGCGAGATACGCTGCTACGCAGCATGCTGTGAG 

TATCGATGACGAGTGACTCTGTACAGTACGTACGTACGTACGTAC 

GTACGTACGACTAT. 
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A - Binary/text 


...10001001111001010110110... 


B - Base-3-encoded 

B e n a m k h o d a v a n d 


;>/7 1 


20112 20200 02110 10002 02212 01112 02110 10221 02212 11021 02212 10101 02212 10002 


C - DNA-encoded 


TCACT ATATA TGTGA CGATA TAGTA TGTGC GCACG TCTAC GCTGC ACGCA TAGTA CGTCA TAGTA CGATA 


D - DNA fragments 



Alternate fragments have file information 
reverse complemented 


Encoded data 

Reverse complemented encoded data 
File identification 
Intra-file location information 
Parity-check 

Reversed or non-reversed flag 


Figure 5.2. The Goldman et a/.encoding method using ASCII and differential coding, Huffman compression, four-fold coverage, reverse complementation 
of alternate data blocks and single parity-check coding. 
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Figure 5.3. The Grass et a/.DNA text conversion, arraying (grouping) and encoding method. 
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Moreover, the odd-numbered DNA blocks are reverse 
complemented so that 

Fi = ATAGTCGTACGTACGTACGTACGTACGTACGTACTGTACAG 

AGTCACTCGTCATCGATACTCACAGCATGCTGCGTAGCAGCGTAT 

CTCGCTGCGAGATG. 

• The file identification for the text equals 12. This gives 
IDq = IDi = 12. The 12 trits intra-file location for 
Fq equals intrao = 000000000000 and for Fi, it equals 
intrai = 000000000001. The parity check Pi for block 
Fi is the sum of the bits at odd locations in IDiintrai 
taken mod 3. Thus, Pq = P\ = l-hOH-OH-OH-OH-O-l-O = 

1. By appending IXi = IDiintrai Pi to Fi we get, 

Fq = TAGTATATCGACTAGTACAGCGTAGCATCTCGCAGCGAGA 
TACGCTGCTACGCAGCATGCTGTGAGTATCGATGACGAGTGACT 
CTGTACAGTACGTACG AT ACGTACGTACGT C 

and 

F[ = ATAGTCGTACGTACGTACGTACGTACGTACGTACTGTACA 
GAGTCACTCGTCATCGATACTCACAGCATGCTGCGTAGCAGCGT 
ATCTCGCTGCGAGATG AT ACGTACGTACGA G. 

• In the last step, we prepend A or T and append C or G 
to even and odd blocks, respectively. The resulting DNA 
codewords equals 

Fq = A TAGTATATCGACTAGTACAGCGTAGCATCTCGCAGCGA 
GATACGCTGCTACGCAGCATGCTGTGAGTATCGATGACGAGTGA 
CTCTGTACAGTACGTACGATACGTACGTACGTCG 

and 

F[' = T ATAGTCGTACGTACGTACGTACGTACGTACGTACTGTA 
CAGAGTCACTCGTCATCGATACTCACAGCATGCTGCGTAGCAGC 
GTATCTCGCTGCGAGATGATACGTACGTACGAG C. 

C. The Grass et al Method 

As it is apparent from the previous exposition, the Church- 
Gao-Kosuri and Goldman et < 2 /. methods did not implement 
error-correction schemes that go beyond single parity-check 
coding of fragments. Additional error-correction was accom¬ 
plished via four-fold coverage. Nevertheless, with the rela¬ 
tively low synthesis and sequencing accuracies of the proposed 
platforms, the lack of advanced error-correction solutions may 
be a significant disadvantage. Furthermore, additional errors 
may arise due to “aging” of the media, as there are no best 
practices for physically storing the DNA strings to maximize 
their stability over long periods of time. 

In the authors addressed both these issues by imple¬ 
menting a specialized error-correcting scheme and by outlining 
best practices for DNA media maintainance. Their experiments 
show that by only combining these two approaches, one should 
be able to store and recover information encoded in the 
DNA from the Global Seed Vault (at 18 8C) for hundreds 
of thousands of years. 

The steps applied in for encoding text onto DNA include: 


• Grouping: Every two characters are mapped to tree 

elements in F (47), the finite field of size 47, via base 
conversion from 256^ to 47^. This results in B informa¬ 
tion arrays of dimension mx k information blocks, with 
elements in F (47). The information arrays are denoted 
by with b e (see Figure 5-B for the 

notation and for an illustration of the grouping). Hence, 
each block M 5 corresponds to a vector of length k, with 
elements in F (47"^). 

• Outer Encoding: Each block M 5 is encoded using a Reed- 
Solomon (RS) code over F (47"^) to a codeword of 
length n. This encoding procedure leads to blocks of size 
m X n. To uniquely identify each column in C 5 , one 
has to append I elements in F (47) to each column. This 
produces vectors ..., of length K = I F m 
each. 

• Inner Encoding: Each vector is mapped to a vector 
of length N over F (47) by using RS coding to obtain 
the codewords i. 

• Mapping to DNA Strings: Each element in cij^i is con¬ 
verted to a DNA string of length 3 so that no homopoly¬ 
mers of length three or longer appear. This process results 
in a DNA strings of length 3N. To complete the mapping 
and encoding, two fixed primers are attached to both 
ends of each created DNA string and used for rapid 
sequencing. 

To experimentally test their method, the authors started 
with 83KB of uncompressed text containing the Swiss Federal 
Charter from 1291 and the English translation of the Methods 
of Archimedes. This information was encoded into 4991 DNA 
oligos of length 158. Each of the oligostrings comprised 117 
“information” nucleotides. The sequences were synthesized 
using the CustomArray electrochemical microarray technology 
described in the previous sections, with a total price of 2, 500 
USD. In the process of information retrieval, custom PCR was 
combined with sequencing on the Illumina MiSeq platform. 

The individual decay rates of different DNA strands are 
mostly influenced by the storage temperature and the water 
concentration of the DNA storage environment. Four different 
dry storage technologies for DNA were tested: pure solid- 
state DNA, DNA on a Whatman ETA filter card, DNA on 
a biopolymeric storage matrix and DNA encapsulated in 
silica. Among the tested methods, DNA encapsulated in silica 
appears to offer the most durable storage format, as silica has 
the lowest water concentration and it separates DNA molecules 
from the environment through an inorganic layer. Therefore, 
the quality of preservation is not affected by environmental 
humidity, which is important since unlike low temperature 
(e.g. permafrost) and absence of light, humidity is relatively 
hard to control. DNA storage systems within silica substrates 
have the further advantages of exceptional stability against 
oxidation and photoresistance, provided that an additional 
titania layer is added to silica. 


6. Random Access and Rewritable DNA-Based 
Storage 

Although the techniques described in 11 ) 0 provided a 
number of solutions for DNA storage, they did not address 
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one important issue: accurate partial and random access to 
data. In all the cited methods, one has to reconstruct the whole 
text in order to read or retrieve the information encoded even 
in a few bases, as the addressing methods used only allow 
for determining the position of a read in a file, but cannot 
ensure precise selection of reads of interest due to potential 
undesired cross-hybridization between the primers and parts of 
the information blocks. Moreover, all current designs support 
read-only storage. Adapting the archival storage solutions to 
address random access and rewriting appears complicated, due 
to the storage format that involves reads of length 100 bps 
shifted by 25 bps so as to ensure four-fold coverage of the 
sequence. In order to rewrite one base, one needs to selectively 
access and modify four consecutive reads. 

The drawbacks of the archival architectures were addressed 
in where new coding-theoretic methods were introduced 
to allow for rewriting and controlled random access. 

A. The Yazdi et ah Method 

To overcome the aforementioned issues, Yazdi et al 
developed a new, random-access and rewritable DNA-based 
storage architecture based on DNA sequences endowed with 
specialized address strings that may be used for selective 
information access and encoding with inherent error-correction 
capabilities. The addresses are designed to be mutually un¬ 
correlated, which means that for a set of addresses A = 
{ai,..., aM}, each of length n, and any two distinct addresses 
a^, Sij G A, no prefix of a^ of length < n — 1 appears as a 
proper suffix of a^. 

Information is encoded into DNA blocks of length L = 
2n ml. The ith block, Bi, is fianked at both ends by two 
unique addresses, one of which, say a^, of length n, is used 
for encoding. The remainder of the block is divided into m 
sub-blocks subi^i ,..., subi^rn^ each of length 1. Encoding of 
the block Bi is performed by first dividing the classical digital 
information stream into m non-overlapping segments and then 
mapping them to integers xi,..., Xm, respectively. Then, each 
Xj, for 1 < j < m, is encoded into a DNA sub-block 
subij of length I using an algorithm, named ENCODEa^,/(xj), 
introduced in i) and described in detail in the next section. 
The algorithm represents an extension of prefix-synchronized 
coding methods (75) (see Eigure [ 6 ^ for an illustration). Given 
that the addresses in A are chosen to be mutually uncorrelated 
and at large Hamming distance from each other, no appears 
as a subword in any DNA block, except at one flanking 
end of the Ah block. This feature enables highly sensitive 
random access and accurate rewriting using the DNA editing 
techniques described in Section 

To experimentally test their scheme, Yazdi et al used 
the introductory pages of five universities retrieved from 
Wikipedia, amounting to a total size of 17KB in ASCII format. 
The text was encoded into 32 DNA blocks of length L = 1000 
bps. To facilitate addressing, they constructed a set of 32 pairs 
of mutually uncorrelated addresses and used 32 of them for 
encoding. The addresses used for encoding A = {ai,..., a 32 } 
were each of length n = 20 bps. Different words in the text 
were counted and tabulated in a dictionary. Each word in the 
dictionary was converted into a binary sequence of length 24. 


Groups of six consecutive words in the file were grouped 
and mapped to binary strings of length 6 x 24 = 144. Two 
bits 11 were appended to the left hand side of each binary 
sequence of length 144 to shift the range of encoded values, 
resulting in sequences of length 146 bits. The binary sequences 
were then translated into DNA sub-blocks of length I = 80 
bps using EnC 0 DE(.)('). Next, m = 12 sub-blocks of length 
80 bps each were adjoined to form a DNA string of length 
12 X 80 = 960 bps. To complete the encoding, each string 
of length 960 bps was equipped with two unique primers of 
length 20 bps at its ends, forming a DNA block of length 
L = 20-f960 + 20 = 1000 bp^ The resulting DNA sequences 
were synthesized by IDT |60||7 at the price of $149 per 1000 
bps. 

To test the rewriting method, all 32 linear 1000 bps frag¬ 
ments were mixed, and the information in three blocks was 
rewritten in the DNA encoded domain using both gBlocks and 
OEPCR editing techniques, described in SectionThe rewrit¬ 
ten blocks were selected, amplified and Sanger sequenced to 
verify that selection and rewriting were performed with 100 % 
accuracy. 

Encoding example: We illustrate next the encoding and 
decoding procedure described in for the short address 
string a = ACCTG, which can easily be verified to be 
self-uncorrelated (i.e., no prefix of the sequence equals 
a suffix of the sequence). Eor the sequence of integers 
Gn,i, Gn, 2 , • • •, Gn, 7 , the construction of which will be de¬ 
scribed in detail in |6-D[ one can verify that 

, Gnj) = (3,9,27,81,267,849,2715). 

Here, n denotes the length of the address string, which in this 
case equals five. The algorithm ENCODEa,8(550) produces 

550 = 0 X G5,7 + 550 
^ ENCODEa,8(550) = CENCODEa, 7 ( 550 ) 

550 = 0 X G5,6 + 550 
^ ENCODEa, 7 ( 550 ) = CENCODEa,6(550) 

550 = 2 X G5,5 + 0 X G5,4 + 16 
^ ENCODEa,6(550) = MENCODEa,4(16), 

16 = 0 X 3^ + 1 X 3^ + 2 X 3^ + 1 X 3^ 

^ ENCODEa,4(16) = ATCT, 

^ ENCODEa,8(550) = CCAAATCT 

When running DECODEa(X) on the encoded output X = 
CCAAATCT , the following steps are executed: 

^ DECODEa(CCAAATCT) = 0 X 
+ DECODEa(CAAATCT) 

^ DECODEa(CAAATCT) = 0 X 
+ DECODEa(AAATCT), 

^ DECODEa(AAATCT) = 2 X G5,5 + 0 X G5,4 

^Two different addresses were used to terminate one sequence because of 
DNA synthesis issues, as having one long repeated string at both flaking ends 
lead to undesired secondary structures. 
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shift k-1 


<- 20 bps 

^ - '- 


- 1000 bps - 

#1 #2 #12 


< - 1000 bps -> 

Figure 6.1. Data format and encoding for the random access, rewritable 
architecture of (^. 

+ DECODEa(ATCT) 

^ DECODEa(ATCT) = 16 

^ DECODEa(CCAAATCT) = 2 X + 16 = 550 

B. Address Design and Constrained Coding 

To encode information on a DNA media, Yazdi et al 
first designed a set A of address sequences, each of length n, 
that satisfies a number of constraints. These constraints make 
the codewords suitable for selective random access; given the 
address set A, they also constructed a code C^(^) of length I 
and provided efficient methods to encode and decode messages 
to codewords in In their experiment, Yazdi et al. chose 

n = 20 and ^ = 80 and stored twelve data subblocks of 
length 80, each corresponding to the codewords in and 

flanked these subblocks with two address sequences to obtain 
a datablock of length 1000 bps. 

In Section |6-C[ we describe the design constraints for the 
address sequences and relate these constraints to previously 
studied concepts such as running digital sums and sequence 
correlation. In Section [6^ we describe the desired properties 
of and present the encoding schemes developed by 

Yazdi et al. based on prefix-synchronized schemes described 
by Morita et al. | [76| . 

C. Constrained Coding for Address Sequences 

Constrained coding serves two purposes in the design of 
address sequences. First, it ensures that DNA patterns prone 
to sequencing errors are avoided. Second, it allows DNA 
blocks to be accurately accessed, amplified and selected with¬ 
out perturbing other blocks in the DNA pool. We remark 
that while these constraints apply to address primer design, 
they indirectly govern the properties of the fully encoded 
DNA information blocks. Specifically, we require the address 
sequences to satisfy the following constraints: 

(Cl) Constant GC content (close to 50%) for all the pre¬ 
fixes of the sequences of sufficiently long length. DNA 


strands with 50% GC content are more stable than 
DNA strands with lower or higher GC content and have 
better coverage during sequencing. Since encoding user 
information is accomplished via prefix-synchronization, 
it is important to impose the GC content constraint on 
the addresses as well as their prefixes, as the latter 
requirement ensures that all fragments of encoded data 
blocks are balanced as well. Given D > 0, we define a 
sequence to be D-GC-prefix-halanced (D-GCPB) if for 
all prefixes (including the sequence itself), the difference 
between the number of G and C bases and the number of 
A and T bases is at most D. A set of address sequences 
is D-GCPB if all sequences in the set are D-GCPB. 
(C2) Large mutual Hamming distance. This reduces the prob¬ 
ability of erroneous address selection. Recall that the 
Hamming distance between two strings of equal length 
equals the number of positions at which the correspond¬ 
ing symbols disagree. Given d > 0, we design our set 
of sequences such that the Hamming distance between 
any pair of distinct sequences is at least d. 

(C3) Uncorrelatedness of the addresses. This imposes the 
restriction that prefixes of one address do not appear as 
suffixes of the same or another address. The motivation 
for this new constraint comes from the fact that addresses 
are used to provide unique identities for the blocks, 
and that their substrings should therefore not appear in 
“similar form” within other addresses. Here, “similarity” 
is assessed in terms of hybridization affinity. Further¬ 
more, long undesired prefix-suffix matches may lead 
to assembly errors in blocks during joint sequencing. 
Most importantly, uncorrelated sequences may be jointly 
avoided via simple and efficient coding methods. Hence, 
one can ensure that address sequences only appear at 
the flanking ends of the blocks and nowhere else in the 
encoding. 

(C4) Absence of secondary (folding) structure for the address 
primers. Such structures may cause errors in the process 
of PCR amplification and fragment rewriting. 

As observed by Yazdi et al., constructing addresses that 
simultaneously satisfy the constraints C1-C4 and determining 
bounds on the largest number of such sequences is pro¬ 
hibitively complex j^. To mitigate this problem, Yazdi et al. 
used a semi-constructive address design approach, in which 
balanced error-correcting codes are designed independently, 
and subsequently expurgated so as to identify a large set 
of mutually uncorrelated sequences. The resulting sequences 
are subsequently tested for secondary structure using mfold 
and Vienna (77). 

In the same paper, Yazdi et al. observed that if one considers 
the constraints individually or one focuses on certain proper 
subsets of constraints, it is possible to construct families of 
codes whose size grow exponentially with code length. To 
demonstrate this, Yazdi et al. borrowed concepts from other 
areas in coding theory. We provide an overview of these 
techniques in what follows. 

Running Digital Sums. An important criteria for selecting 
block addresses is to ensure that the corresponding DNA 
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primer sequences have prefixes with a GC content approx¬ 
imately equal to 50%, and that the sequences are at large 
pairwise Hamming distance. Due to their applications in 
optical storage, codes that address related issues have been 
studied in a slightly different form under the name of bounded 
running digital sum (BRDS) codes | [78| , | [79| . A detailed 
overview of this coding technique may be found in (7^. 

Fix an integer D > 0. A binary sequence a has a D-bounded 
running digital sum (D-BRDS) if for any prefix of a (including 
a itself), the number of zeroes and the number of ones differ 
by at most D. A set A of binary sequences is D-BRDS if 
all sequences in A have D-BRDS. A 1-BRDS set A with 
minimum distance 2d may be obtained from a binary code 
with distance d via the following theorem. 

Theorem 6.1 ( Thm 2]). If a binary unrestricted code 
of length n, size M and minimum distance d exists, then a 
1-BRDS set of length 2n and minimum distance 2d and size 
M exists. 


Hence, it follows from the Gilbert-Varshamov bound that 
there exists a 1-BRDS set of length 2n and minimum distance 
2d whose size is at least 2'^/ (p)- 

A set of DNA sequences over {A,T,G,C} may then be 
constructed in a straightforward manner by mapping each 0 
into one of the bases {A, T} , and 1 into one of the bases {G, C}. 
In other words, a D-BRDS set of length n and size M yields a 
D-GCPB set of sequences of size M. For 0 < d < n, D > 0, 
let Mi(n, d] D) denote the maximum size of a D-GCPB set of 
sequences of length n and minimum distance d. Furthermore, 
for ^ > 0 , let Aq[n^d) denote the maximum size of a ^-ary 
code with minimum distance d. 


Applying Theorem |6.1| and the simple mapping above, we 
have the following estimates for the size of codes satisfying 
Cl and C2. 


Theorem 6.2. Fix 0<d<n, D = l. Then 


A 2 (n/ 2 , d/2) < Mi(n, d; 1 ) < ^ 4 ( 71 , d). ( 6 . 1 ) 


Sequence Correlation 

We describe next the notion of autocorrelation of a sequence 
and introduce the related notion of mutual correlation of se¬ 
quences. It was shown in [ [80| that the autocorrelation function 
is the crucial mathematical concept for studying sequences 
avoiding forbidden words (strings) and sub words (substrings). 
In order to accommodate the need for selective retrieval of 
a DNA block without accidentally selecting any undesirable 
blocks, we find it necessary to also introduce the notion of 
mutually uncorrelated sequences. 

Let X and Y be two words, possibly of different lengths, 
over some alphabet of size q > 1. The correlation of X and 
F, denoted by A o F, is a binary string of the same length 
as X. The Fth bit (from the left) of A o F is determined by 
placing F under A so that the leftmost character of F is under 
the 7 -th character (from the left) of A, and checking whether 
the characters in the overlapping segments of A and F are 
identical. If they are identical, the i-th bit of A o F is set to 


1, otherwise, it is set to 0. For example, for A = GTAGTAG 
and F = TAGTAGCC, A o F = 0100100, as depicted below. 

Note that in general, A o F % F o A, and that the two 
correlation vectors may be of different lengths. In the example 
above, we have F o A = 00000000. The autocorrelation of a 
word A equals A o A. 

In the example below, A oA = 1001001. 


A=GTAGTAG 

F=TAGTAGCC 0 

TAGTAGCC 1 

TAGTAGCC 0 

TAGTAGCC 0 

TAGTAGCC 1 

TAGTAGCC 0 

TAGTAGCCO 

Definition 6.1. A sequence X is self-uncorrelated if XoX = 
10 ... 0. A set of sequences {Ai, A 2 ,..., A^} is termed 


mutually uncorrelated if each sequence is self-uncorrelated and 
if all pairs of distinct sequences satisfy XiO Xj = 0 ... 0 and 
XjoXi = 0...0. 

The notion of mutual uncorrelatedness may be relaxed by 
requiring that only sufficiently long prefixes do not match suf¬ 
ficiently long suffixes of other sequences. Sequences with this 
property, and at sufficiently large Hamming distance, eliminate 
undesired address cross-hybridization during selection. 

Mutually uncorrelated codes were studied by many au¬ 
thors under a variety of names. Levenshtein first introduced 
them in 1964 under the name ‘strongly regular codes’ | [M| , 
suggesting that the codes are interesting for synchronisation 
applications. Inspired by the use of distributed sequences in 
frame synchronisation applications by van Wijngaarden and 
Willink 182) , Bajic and Stojanovic [ 8 ^ recently indepen¬ 
dently rediscovered mutually uncorrelated codes using the 
term ’cross-bifix-free’ (see also |[84|-|[86| for recent papers and 
the references therein). The maximum size of a set of mutually 
uncorrelated code has been determined up to a constant factor 
by Blackburn | [ 86 ) . We state his result below. 

Theorem 6.3. Let M 2 {n) be the maximum size of a set of 
mutually uncorrelated sequences of length n. Then 

3-4^ 4n / i \ n-l 

We point to an interesting construction by Bilotta et al. 
| [84) and provide a simple modification to obtain a set of 
sequences satisfying Cl and C3. To do so, we introduce a 
simple combinatorial object called a Dyck word. A Dyck word 
is a binary string consisting of m zeroes and m ones such that 
no prefix of the word has more zeroes than ones. 

By definition, a Dyck word necessarily starts with a one and 
ends with a zero. Consider a set V of Dyck words of length 
2m and define the following set of words of length 2m -b 1 , 

A = {la : a G V}. 

Bilotta et al. demonstrated that ^ is a mutually uncorrelated 
set of sequences. 
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A Dyck word has height at most D if for any prefix of 
the word, the difference between the number of ones and the 
number of zeroes is at most D. In other words, a Dyck word 
has height at most D if it has D-BRDS. Let Dyck(m, D) 
denote the number of Dyck words of length 2m and height at 
most D. de Bruijn et al | [87| proved that for fixed values of 
D, 

A 'jr 'jr 

Dyck(m, D) ^ —-tan^ —-cos^"^ —-. (6.2) 

Here, f{m) ^ g{m) means that lim^^oo /(^)/^(^) = 1- 

As with Bilotta et al., we observe that if we prepend Dyck 
words of length 2m and height at most D by 1, we obtain 
a mutually uncorrelated D + 1-BRDS set of binary words of 
length 2m + 1. As before, we map 0 and 1 into {A,T} and 
{C, G}, respectively, and obtain a mutually uncorrelated D +1- 
GCPB set of sequences. 

Theorem 6.4. Let M^(n^D) be the maximum size of a 
mutually uncorrelated D-GCPB set of sequences of length n. 
If n is odd and D >2, then 

2 ^ ^ TT IT 

Msin, D) > —^ tan^ — cos”"^ 

As already pointed out, it is an open problem to determine 
the largest number of address sequences that jointly satisfy the 
constraints Cl to C4. We conjecture that the number of such 
sequences is exponential in n, since the number of words that 
satisfy C1+C2, C3, and C1+C3 separately is exponential (see 
Theorems |6.2|6.3[|6.4[ ). Furthermore, the number of words that 
avoid secondary structures was also shown to be exponentially 
large by Milenkovic and Nashyap | [63| . 

D. Prefix-Synchronized DNA Codes 

Thus far, we described how to construct address sequences 
that may serve as unique identifiers of the blocks they are 
associated with. We also pointed out that once such address 
sequences are identified, user information has to be encoded so 
as to avoid the appearance of any of the addresses, sufficiently 
long substrings of the addresses, or substrings similar to the 
addresses in the resulting codewords. 

Specifically, for a fixed set A of address sequences of length 
n, we define the set to be the set of sequences of length 

I such that each sequence in C^(^) does not contain any string 
belonging to A. Therefore, by definition, when i < n, the set 
Ca{^) is simply the set of strings of length i. Our objective 
is then to design an efficient encoding algorithm (one-to-one 
mapping) to encode a set I of messages into C^(^). For the 
sake of simplicity, we let X = {0,1, 2,..., |X| — 1} and as is 
usual with constrained coding, we hope to maximize \I\. 

Clearly, \I\ < \C^{i)\ and hence, it is of interest to 
determine the size of C^(^). In the case, when ^ is a set 
of mutually uncorrelated strings, Yazdi et al. proved the 
following theorem. 


Theorem 6.5. Suppose that ^ is a set of M mutually uncor¬ 
related sequences of length n over the alphabet {A,T, C,G}. 
Define F{z) = Then 


F(z) 


1 

1 — 42 ; + Mz^ 


(6.3) 


We make certain observations on When M is fixed, 
it is easy to show that F{z) = 1/(1 — 4^ -b Mz'^) has only 
one pole with radius less than one for sufficiently large n. 
Furthermore, if R~^ is the pole of F, we can show that 1/4 < 
R~^ < 1/(4 —e(n)) with e(n) = o(l). Here, the asymptotic is 
computed with respect to n. In other words, for the case where 
M is fixed, the size of Ca{^) is at least (4 — e(n))^(l — o(l)) 
(here, asymptotic is computed with respect to £). 

In the case where A contains a single address a, Morita 
et al. proposed efficient encoding schemes into in 

the context of prefix-synchronized codes | [76| . Based on the 
scheme of Morita et al., Yazdi et al. developed another 
encoding method that encodes messages into Ca{£) where 
A contains more than one address. In this scheme, Yazdi et 
( 2 /. assume that A is mutually uncorrelated and all sequences 
in A end with the same base, which we assume without 
loss of generality to be G. We then pick an address a = 
(ai, a 2 ,..., Un) G A and define the following entities for 
1 < i < n — 1, 

Ai = {A,C,T}\{ai}, 

= (ai,a2,... ,aj). 

In addition, assume that the elements of Ai are arranged 
in increasing order, say using the lexicographical ordering 
A ^ C ^ T. We subsequently use dij to denote the j-th 
smallest element in Ai, for 1 < j < \Ai\. For example, if 
Ai = {C, T} , then di^i = C and di ^2 = T. 

Next, we define a sequence of integers Gn,i, Gn, 2 , • • • that 
satisfies the following recursive formula 

^ _/3^ l<£<n, 

£>n. 

For an integer £ > 0 and y < 3^, let dg (y) = {A, T, C}^ be 
a length-^ ternary representation of y. Conversely, for each 
W G {A, T,C}^, let (W) be the integer y such that 

(y) = We proceed to describe how to map every integer 
{0,1,..., Gn/ — 1} into a sequence of length i in C^{i) 
and vice versa. We denote these functions as FNCODEa/ and 
Decode, respectively. 

The steps of the encoding and decoding procedures are 
listed in Algorithm |6-D| and the correctness of was demon¬ 
strated by Yazdi et al.. 

Theorem 6.6. Let ^ be a set of mutually uncorrelated 
sequences that ends with the same base. Then FNCODEa,^ is 
an one-to-one map from {0,1,..., Gn,i — 1} to Ca{£) and for 
all X G {0, 1, ... , Gn,£ - 1}, DECODEa(ENCODEa,^(x)) = X. 

In their experiment, Yazdi et al. found a set ^ of M = 32 
address sequences of length n = 20 and used this method 
to encode information into = 80). In this instance, the 
value of G 2 o ,80 = 1.56 x 10^^ > 126 bits, while the size of 
C^(80) is 1.462 X 10^^ > 159 bits. 

The previously described ENCODEa/(x) algorithm imposes 
no limitations on the length of a prefix used for encoding. 
This feature may lead to unwanted cross hybridization be¬ 
tween address primers used for selection and the prefixes of 
addresses encoding the information. One approach to mitigate 
this problem is to “perturb” long prefixes in the encoded 
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information in a controlled manner. For small-scale random 
access/rewriting experiments, the recommended approach is to 
first select all prefixes of length greater than some predefined 
threshold. Afterwards, the first and last quarter of the bases 
of these long prefixes are used unchanged while the central 
portion of the prefix string is cyclically shifted by half of its 
length. 

For example, for the address primer a = 
ACTAACTGTGCGACTGATGC, if the prefix = 

ACTAACTGTGCGACTG appears as a subword, say p, in 
X = ENCODEa/(x) then X is modified to X' by mapping 
p to p' = ACTAATGCCTGGACTG. This process of shifting is 
illustrated below: 

_p_ 

X =... ACTGTGCGACTGATGC ... 

cyclically shift by 3 

X' =... ACTGTACTGCGGATGC ... 

P' 

For an arbitrary choice of the addresses, this scheme may 
not allow for unique decoding ENCODEa/(x). However, there 
exist simple conditions that can be checked to eliminate 
primers that do not allow this transform to be “unique”. Given 
the address primers created for our random access/rewriting 
experiments, we were able to uniquely map each modified 
prefix to its original prefix and therefore uniquely decode the 
readouts. 

As a final remark, we would like to point out that prefix- 
synchronized coding also supports error detection and limited 
error-correction. Error-correction is achieved by checking if 
each substring of the sequence represents a prefix or “shifted” 
prefix of the given address sequence and making proper 
changes when needed. 

E. Error-Control Coding for DNA Storage 

Based on the discussion of error mechanisms in DNA 
synthesis and sequencing, it is apparent that most errors follow 
into the following categories: 

• Substitution errors introduced during synthesis. These 

errors may be addressed using many classical coding 
schemes, such as Reed-Solomon and Low-Density Parity- 
Check coding methods |[7j. One non-trivial problem as¬ 
sociated with substitution errors introduced during the 
synthesis phase arises after high-throughput sequencing. 
In this case, errors in the synthesized sequences propagate 
through a number of reads produced during sequencing, 
and hence correspond to a previously unknown class 
of burst errors. The authors addressed this issue in a 
companion paper 0 0 where they introduced the 

notion of DNA profile codes, which have the property 
that they can correct combinations of sequencing and 
synthesis errors in reads, in addition to missing coverage 
(i.e., missing read errors). 

• Single deletion errors introduced during synthesis. Iso¬ 
lated single deletion errors may be corrected by using 
Levenshtein-Tenengolts codes fS^ , directly encoded into 



Figure 6.2. Impulse response of prototypical solid state nanopore sequencers. 

the DNA string. It also appears possible to extend the 
DNA profile coding paradigm to encompass deletion and 
insertion errors incurred during synthesis, although no 
results in this directions were reported. 

• Substitution and coverage errors introduced during se¬ 
quencing. These errors may be handled in a similar 
manner as substitution errors introduced during synthesis, 
provided that they are used with the correct sequenc¬ 
ing platform (i.e., Illumina). For the third generation 
sequencing platforms - PacBio and Oxford Nanopore 
- only one specialized error-correction procedure was 
reported so far |T0| , addressing problems arising due to 
overlapping impulse responses of two out of four bases 
(see Figure |6-E| ). 

It remains an open problem to design codes that efficiently 
combine all the constraints imposed by address design consid¬ 
erations and at the same provide robustness to both synthesis 
and sequencing errors. 

Appendix 

• Bases A, T, G and C\ Nucleotides, the building units 
of DNA, include one out of four possible bases, A 
(adenine), G (guanine), C (cytosine), and T (thymine). 
With a slight abuse of meaning, we alternatively use the 
terms nucleotides and bases, and express DNA sequence 
lengths in nucleotides or basepairs. 

• Capillary Electrophoresis: Capillary electrophoresis is 
a technique that separates ions based on their elec¬ 
trophoretic mobility, observed when applying a controlled 
voltage. 

• Clone: A section of DNA that has been inserted into a 
vector molecule, such as a plasmid, and then replicated 
to form many identical copies. 

• Coverage (of a sequencing experiment): The average 
number of reads that contains a base at a particular 
position in the DNA string to be sequenced. 

• De novo: Erom scratch, without a template, anew. 

• Deoxinucleotides: Components of DNA, containing the 
phosphate, sugar and organic base; when in the triphos¬ 
phate form, they are the precursors required by DNA 
polymerase for DNA synthesis (i.e., ATP, CTP, GTP, 
TTP). 

• DNA microarray: A DNA microarray (also commonly 
known as DNA chip or biochip) is a collection of 
microscopic DNA spots containing relatively short DNA 
fragments termed probes, attached to a solid surface. 

• DNA Hybridization: DNA Hybridization is the process 
of combining two complementary (in the Watson-Crick 
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Algorithm 1 Encoding and decoding 


= ENCODEa,£(x) 

n 

X 

= DECODEa (X) 


begi 

begin 


1 

if {i > n) 

1 

1 = length (X); 


2 

t ^ 1; 

2 

X = XiX2...X^; 


3 

y ^ X- 

3 

if {1 < n) 


4 

while {y > \At\ 

4 

return 6~^ (-^); 


5 

y ^ y — \^t 1 

5 

else 

-Vat,s=Xi...Xf, 

6 

t i — f -|- 1; 

6 

find(s, f) such that 

7 

end; 

7 

return \MGn, 

X-lj + (s - l)Gn,t-t + DECODEa(Xt+i . . . X()\ 

8 

^ \.y/ Gn,e-t\ 5 

8 

end; 

9 

b^y mod Gn,i-E 

end; 


10 

return a^^“^)dt,a+iENCODEa,^-t (6); 




11 

else 




12 

return 0^ (y); 




13 

end; 




end; 






sense) single-stranded DNA or RNA molecules and al¬ 
lowing them to form a single double-stranded molecule 
through base pairing. 

• Dye-terminators: Labeled versions of dideoxyribonu- 
cleotide triphosphates (ddNTPs), “defective” nucleotides 
used in Sanger sequencing. 

• Enzyme: Enzymes are biological molecules (proteins) 
that accelerate, or catalyze, chemical reactions. 

• Heteroduplex: A heteroduplex is a double-stranded (du¬ 
plex) molecule of nucleic acid originated through the 
genetic recombination of single complementary strands 
derived from different sources, such as from different 
homologous chromosomes or even from different organ¬ 
isms. 

• Homologs: Two chromosomes or fragments from chro¬ 
mosomes from a particular pair, containing the same 
genetic loci in the same order. 

• Homopolymers: Sequences of identical bases in DNA 
strings. 

• In vivo recombination: Recombination is the process of 
combining genetic (DNA) material from multiple sources 
to create new sequences. In vivo recombination refers to 
recombination performed inside a living cell (in vivo). 

• Ligase: An enzyme that catalyzes the process of joining 
two molecules through the formation of new chemical 
bonds. 

• Luciferase: An oxidative enzyme used to provide lumi¬ 
nescence in natural or controlled biological environments. 

• Oligonucleotide (short strand of nucleotides): A rela¬ 
tively short sequence of nucleotides, usually synthesized 
to match a region where a mutation is known to occur. 

• Polymerase chain reaction (PCR): Polymerase chain 
reaction (PCR) is a laboratory technique used to amplify 
DNA sequences. The method involves using short DNA 
sequences called primers to select the portion of the 
genome to be amplified. The temperature of the sample is 
repeatedly raised and lowered to help a DNA replication 
enzyme copy the target DNA sequence. The technique 
can produce a billion copies of the target sequence in 
just a few hours. 

• Primer: A primer is a strand of short nucleic acid se¬ 
quences that serves as a starting point for DNA synthesis. 



Figure A.l. Principles of DNA denaturation and hybridization. 


• Protein: Proteins are large biological molecules, or 
macromolecules, consisting of one or more long chains 
of amino acid residues. 

• Read: DNA fragment created during the sequencing 
process. 

• Sequence assembly: Sequence assembly refers to align¬ 
ing and merging fragments of a much longer DNA 
sequence in order to reconstruct the original sequence. 

• Symmetric dimer: A chemical structure formed from 
two symmetric units. 
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